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Abstract 

We consider a finite mixture of regressions (FMR) model for high-dimensional in- 
homogeneous data where the number of covariates may be much larger than sample 
size. We propose an ^i-penalized maximum likelihood estimator in an appropriate 
parameterization. This kind of estimation belongs to a class of problems where opti- 
mization and theory for non-convex functions is needed. This distinguishes itself very 
clearly from high-dimensional estimation with convex loss- or objective functions, as 
for example with the Lasso in linear or generalized linear models. Mixture models 
represent a prime and important example where non-convexity arises. 

For FMR models, we develop an efficient EM algorithm for numerical optimiza- 
tion with provable convergence properties. Our penalized estimator is numerically 
better posed (e.g., boundedness of the criterion function) than unpenalized maximum 
likelihood estimation, and it allows for effective statistical regularization including 
variable selection. We also present some asymptotic theory and oracle inequalities: 
due to non-convexity of the negative log-likelihood function, different mathematical 
arguments are needed than for problems with convex losses. Finally, we apply the 
new method to both simulated and real data. 

Keywords Adaptive Lasso, Finite mixture models. Generalized EM algorithm. High- 
dimensional estimation. Lasso, Oracle inequality 

This is the authors version of the work (published as a discussion paper 
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1 Introduction 

In applied statistics, tremendous number of applications deal with relating a random re- 
sponse variable y to a set of explanatory variables or covariates X = {X^^\ . . . , X^^^) 
through a regression- type model. The homogeneity assumption that the regression coef- 
ficients are the same for different observations (Yi,Xi), . . . , (1^,X„) is often inadequate. 
Parameters may change for different subgroups of observations. Such heterogeneity can 
be modeled with a Finite Mixture of Regressions (FMR) model. Especially with high- 
dimensional data, where the number of covariates p is much larger than sample size n, 
the homogeneity assumption seems rather restrictive: at least a fraction of covariates 
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may exhibit a different influence on the response among various observations (i.e., sub- 
populations). Hence, addressing the issue of heterogeneity in high-dimensional data is 
important in many practical applications. We will empirically demonstrate with real data 
in Section 17.21 that substantial prediction improvements are possible by incorporating a 
heterogeneity structure to the model. 

We propose here an i'l-penalized method, i.e., a Lasso-type estimator ( Tibshirani . 19961 ). 
for estimating a high-dimensional Finite Mix ture of Regression s (FM R) model where p ^ 
n. Our procedure is related to the proposal in lKhalili and Chen! (j2007l ). However, we argue 
that a different parameterization leads to more efficient computation in high-dimensional 
optimization for which we prove numerical convergence properties. Our algorithm can 
easily handle problems where p is in the thousands. Furthermore, regarding statistical 
performance, we present a n oracle inequalit y whic h includes the setting where p ^ n: 
this is very different from iKhalili and ChenI (j2007l ) who use fixed p asymptotics in the 
low-dimensional framework. Our theory for deriving oracle inequalities in the presence of 
non-convex loss functions, as the negative log-likelihood in a mixture model is non-convex, 
is rather non-standard but sufficiently general to cover other cases than FMR models. 

From a more general point of view, we show in this paper that high-dimensional esti- 
mation problems with non-convex loss functions can be addressed with high computa- 
tional efficiency and good statistical accuracy. Regarding the computation, we develop 
a rather generic block coordinate descent generalized EM algorithm which is surprisingly 
fast even for la ,rge p. Progr ess in efficient gradient desc ent methods build on various de- 
velopments by [Tseng (l200lh and IXseng and YmJ (jgOOSL and th eir use for solving Lasso - 
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Meier et al.\ (|2008l ) and iFriedman et al.\ ((20101). We present in Section O some com 



type convex probl ems h as been worked out by , e.g., Fu ( 19981 ). Friedman et al. (j2007l ). 



putation times for the more involved case with non-convex objective function using a 
block coordinate descent generalized EM algorithm. Regarding statistical theory, almost 
all results for high-dimensional Lasso-type problems have been developed for conv e x loss 
functions, e.g., the squared error in a Gaussian regression (IGreenshtein and Ritovl. l2004l: 
Meinshausen a nd Biihlmann . '2006'; 'Zhao and Yu', '2006'; 'Bune a al.l.l2007l:lzha ng and Huan3. 



2008: Meinshausen and Yd. 12009: Wa inwrieht. 2009: Bick el et al 



2009 



Cai et al.. . 2009b : 



Candes and Planl. l2009l: IZhangj . l2009l ) or the negative log-likelihood in a generalized linear 
model ( van de Geerl . 2008 ). We present a non-trivial modification of the mathematical 
analysis of £i-penalized estimation with convex loss to non-convex but smooth likelihood 
problems. 

When estimation is defined via optimization of a non-convex objective function, there is 
a major gap between the actual computation and the procedure studied in theory. The 
computation is typically guaranteed to find a local optimum of the objective function only, 
whereas the theory is usually about the estimator defined by a global optimum. Partic- 
ularly in high-dimensional problems, it is difficult to compute a global optimum and it 
would be desirable to have some theoretical properties of estimators arising from local op- 
tima. We do not provide an answer to this difficult issue in this thesis. The beauty of, e.g., 
the Lasso or the Dantzig selector (jCandes and Tad . 120071 ) in high-dimensional problems 
is the provable correctness or optimality of the estimator which is actually computed. A 
challenge for future research is to establish such provable correctness of estimators involv- 
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ing non-convex objective functions. A noticeable exception is presented in IZhand tom ) 
for linear models, where some theory is derived for an estimator based on a local optimum 
of a non-convex optimization criterion. 



The rest of this article is mainly focusing on Finite Mixture of Regressions (FMR) models. 
Some theory for high-dimensional estimation with non-convex loss functions is presented 
in Section [5] for more general settings than FMR models. The further organization of 
the paper is as follows: Section [2] describes the FMR model with an appropriate param- 
eterization, Section [3] introduces £i-penalized maximum-likelihood estimation. Sections H] 
and [5] present mathematical theory for the low- and high-dimensional case. Section [6] de- 
velops some efficient generalized EM algorithm and describes its numerical convergence 
properties, and Section [7] reports on simulations, real data analysis and computational 
timings. 

2 Finite mixture of Gaussian regressions model 

Our primary focus is on the following mixture model involving Gaussian components: 

Yi\Xi independent for i = 1, . . . , n, 
Yi\Xi = x ~ f^{y\x)dy for z = 1, . . . , n, 

^ 1 {y-x^j3..)'^ 

hiv\-) = i:-r^^^M ^^), (1) 

i = (/?!,..., /3fc,(Ji,...,cJfc,7ri,...,7rfc_i) G M'^p x M^q x H, 

fc-i 

n = {tt; VTr > for r = 1, . . . , A; — 1 and 7r.r < 1}. 

r=l 

Thereby, Xi € W are fixed or random covariates, 1^ E M is a univariate response variable 
and = . . . , di, . . . , Ufc, vri, . . . , vr^.i) denotes the {p + 2)-k — l free parameters and 
TTfc is given by tt^ = 1 — T^r- The model in ([T]) is a mixture of Gaussian regressions, 

where every component r has its individual vector of regression coefficients f3r and error 
variances cj^. We are particularly interested in the case where p ^ n. 

2.1 Reparameterized mixture of regressions model 

We prefer to work with a reparameterized version of model ([T]) whose penalized maximum 
likelihood estimator is scale-invariant and easier to compute. The computational aspect 
will be discussed in greater detail in Sections 13.11 and [H Define new parameters 

(j)r = f3r/(Tr, Pr = CT^^ , r=l,...,k. 

This yields a one-to-one mapping from ^ in (JT]) to a new parameter vector 

6 = {(pi, . . . ,4)k, Pi, ■ ■ ■ ,Pk, T^i, ■ ■ ■ ,7rfc-i), 
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and the model ([T]) in reparameterized form then equals: 

Yi\Xi independent for i = 1, . . . , n, 
Yi\Xi = X ~ hg{y\x)dy for z = 1, . . . , n, 



k 

he{y\x) = ^7r^-^exp(-- 

r=l 



{pry - X^C^rf) (2) 



= ((/.i,...,(/.fc,pi,...,Pfc,7ri,...,7rfc_i) eM'^'P xM^o x H 
n = {vr; TTr > for r = 1, . . . , /c — 1 and vr^ < 1}, 

r=l 

with VTfc = 1 — X^^=i This is the main model we are analyzing and working with. 

The log-likelihood function of this model equals 

n / k \ 
£(0;y) = ^log j;^,-^exp(--(p,y,-Xf0,)2) . (3) 

i=l \r=l ^ / 

Since we want to deal with the p ^ n case, we have to regularize the maximum likelihood 
estimator (MLE) in order to obtain reasonably accurate estimates. We propose below 
some £i-norm penalized MLE which is different from a naive ^i-norm penalty for the MLE 
in the non-transformed model ([1]). Furthermore, it is well known that the (log-) likelihood 
function is generally unbounded. We will see in Section 13.21 that our penalization will 
mitigate this problem. 

3 ^i-norm penalized maximum likelihood estimator 

We argue first for the case of a (non-mixture) linear model why the reparameterization 
above in Section [2.11 is useful and quite natural. 

3.1 ^i-norm penalization for reparameterized linear models 

Consider a Gaussian linear model 



= ^/3,Xp^+e,, i = l,...,n, 



£!,...,£„ i.i.d. ^M{0,a^), (4) 
where Xi are either fixed or random covariates. In short, we often write 

y = X/3 + e, 

with n X 1 vectors Y and e, pxl vector /3 and nxp matrix X. In the se quel, || • || denotes t he 
Euclidean norm. The ^i-norm penalized estimator, called the Lasso ( Tibshirani ( 19961 )). 
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is defined as 



Px = argminn"^||y - X/3f + aV |/3j|. (5) 

Here A is a non- negative regularization parameter. Tlie Gaussian assumption is not cru- 
cial in model (jH) but it is useful to make connections to the likelihood framework. The 
Lasso estimator in ([5]) is equivalent to minimizing the penalized negative log-likelihood 
n~^£{f3] Yi, . . . , Yn) as a function of the regression coefficients /3 and using the ^i-penalty 
||/3||i = Yl^=i l/^il- equivalence here means that we obtain the same estimator for a po- 
tentially different tuning parameter. But the Lasso estimator in 1^ does not provide an 
estimate of the nuisance parameter cr^. 

In mixture models, it will be crucial to have a good estimator of o"^ and the role of the 
scaling of the variance parameter is much more important than in homogeneous regression 
models. Hence, it is important to take into the definition and optimization of the 
penalized maximum likelihood estimator: we could proceed with the following estimator, 

/3a,<tI = aTgmm-n-^£{(3,a^;Yi, . . . ,y„) + A||/3||i 

/3,o-2 

= argminlog(a) + ||y - X^f/{2na^) + A||/3||i. (6) 

Note that we are penalizing only the /3-parameter. However, the scale parameter estimate 
(t| is influenced indirectly by the amount of shrinkage A. 

There are two main drawbacks of the estimator in ([6]). First, it is not equivariant 



(iLehmannl . Il983l ) under scaling of the response. A/[ore precisely, consider the transfor- 



mation 

Y' = bY, 13' = bp, a' = ba (6 > 0) (7) 

which leaves model ([4]) invariant. A reasonable estimator based on transformed data Y' 
should lead to estimators f3',a' which are related to f3,a through /3' = 6/3 and a' = ba. 
This is not the case for the estimator in ([6]). Secondly, and as important as the first 
issue, the optimization in ([6]) is non-convex and hence, some of the major computational 
advantages of Lasso for high-dimensional problems is lost. We address these drawbacks 
by using the penalty term A^^^!^ leading to the following estimator: 

^2 



I3x,ai = argminlog(a) + ||y - X/3|| V(2na') + A 

This estimator is equivariant under the scaling transformation ([7|), i.e., the estimators a' 
based on Y' transform as /3' = b(3 and a' = ba. Furthermore, it penalizes the ^i-norm of 
the coefficients and small variances simu ltaneously which has some close relations to 
the Bayesian Lasso ( Park and Casella . 20081 ) . For the latter, a Bayesian approach is used 



with a conditional Laplace prior specification of the form 

A J/3,| 
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and a noninformative scale- invariant marginal prior p{cr^) = l/cr^ for u^. iPark and Casella 
argue that conditioning on cr^ is important because it guarantees a unimodal full 
posterior. 

Most importantly, we can re-parameterize to achieve convexity of the optimization problem 

This then yields the following estimator which is equivariant under scaling and whose 
computation involves convex optimization: 



h,Px = argmin-log(p) + ^\\pY - X(/>f + A 
4>,p 2n 



(8) 



Prom an algorithmic point of view, f ast algorithms are available to solve the optimization in 
dH). Shooting algorithms 
as dem onstrated by, e.g. 



( 


20071). Meier et al. 


( 


20081) or 



Friedman et al. 



model, a more complex task than the optimization for ([8]). As we will see in Section 
6.H we will make use of the Karush-Kuhn- Tucker (KKT) conditions in the M-Step of 
a generalized EM algorithm. For the simpler criterion in ([8]) for a non-mixture model, 
the KKT conditions imply the following which we state without a proof. Denote by (•, •) 
the inner product in n-dimensional Euclidean space and by Xj the jth column vector of X. 



Proposition 1. Every solution {<j),p) of ^ satisfies: 

-p(X„y) + (X,-,X,^)+nAsgn(0,) = if 
\-p(Xj,Y) + {Xj,X4>)\ < nX if 

and 



h / o> 
h = 0' 



(y,x</.) + V(>',X(/.)2 + 4||y||2n 
Imp ■ 



3.2 £i-norm penalized MLE for mixture of Gaussian regressions 

Consider the mixture of Gaussian regressions model in Q. Assuming that p is large, 
we want to regularize the MLE. In the spirit of the approach in ([8]), we propose for the 
unknown parameter 9 = {(pi, . . . , 0^, /9i, . . . , /Ofe, vri, . . . , iTk-i) the estimator: 

^(^)=argmin-n-¥^j^^(0), (9) 
eee ^ ' 

k 

+ Xj27r7Ur\\i, (10) 

r=l 

@ = R>'P X M^o X n, (11) 
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where 11 = {tt; tt^ > for r = 1, . . . , A: — 1 and T^r < 1} with vr^ = 1 — J2t=i '^r- The 

value of 7 G {0, 1/2, 1} parameterizes three different penalties. 

The first penalty function with 7 = is independent of the component probabilities vr,.. 
As we will see in Sections 16.11 and | 6 . 4|, the optimization for computing 6^^^ is easiest, and 
we establish a rigorous result about numerical convergence of a generalized EM algorithm. 
The penalty with 7 = works fine if the components are not very unbalanced, i.e., the true 
TTr's aren't too different. In case of strongly unbalanced components, the penalties with 
values 7 G {1/2, 1} are to be preferred, at the price of having t o pursue a more d i fficult 
optimization problem. The value of 7 = 1 has been proposed by Khalili and Chen ( 200?! ) 



for the naively parameterized likelihood from model ([T]). We will report in Section 17.11 
about empirical comparisons with the three different penalties involving 7 € {0, 1/2, 1}. 

All three penalty functions involve the £i-norm of the component specific ratios (pr = ^ 
and hence small variances are penalized. The penalized criteria therefore stay finite when- 
ever cr,. — )■ 0: this is in sharp contrast to the unpenalized MLE w here the likelihood is 
unbounded if ar — s- 0; see, for example, iMcLachlan and Peel hood ). 



Proposition 2. Assume that Yi ^ for all i = l,...,n. Then the penalized negative 
log-likelihood — '^""'^^pen a(^) ^'^ hounded from below for all values G from 

A proof is given in Appendix [Cj Even though Proposition [2] is only stated and proved for 
the penalized negative log-likelihood with 7 = 0, we expect that the statement is also true 
for 7 = 1/2 or 1. 

Due to the ^i-norm penalty, the estimator is shrinking some of the components of 0i, . . . , 
exactly to zero, depending on the magnitude of the regularization parameter A. Thus, we 
can do variable selection as follows. Denote by 

S = {(r, j); 1 < r < fc, 1 < j < pjrj / o} . (12) 

Here, (pr,j is the jth coefficient of the estimated regression parameter (pj. belonging to 
mixture component r. The set S denotes the collection of non-zero estimated, i.e., selected, 
regression coefficients in the k mixture components. Note that no significance testing 
is involved, but, of course, S = depends on the specification of the regularization 
parameter A and the type of penalty described by 7. 



3.3 Adaptive £i-norm penalization 



A two -stage adaptive ^i-norm penalization for linear models has been proposed by IZou 



(|2006l ). called the adaptive Lasso. It is an effective way to address some bias problems of 



the (one-stage) Lasso which may employ strong shrinkage of coefficients corresponding to 
important variables. 

The two-stage adaptive ^i-norm penalized estimator for a mixture of Gaussian regressions 
is defined as follows. Consider an initial estimate for example, from the estimator 
in ([9]). The adaptive criterion to be minimized involves a re- weighted £i-norm penalty 
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term: 



i=l \r=l 
k p 

+ A^7r7^U'rj|(^rj|, 
r=l j=l 

If^r" I 



1 



If. 



^^eM-^iPrY^-Xf<Prf) 



,^fe-l), 



where 7 G {0, 1/2, 1}. The estimator is then defined as 



9Sipt:A = argmin-n-i4ip,(0), 
^ ' 6»ee ^ 



where is as in 



(13) 



(14) 



The adap tive Lasso in hnear mode l s has better variable sele ction properties than the 
Lasso, see Zou ( 20061 ). Huang et al. ( 20081 ). van de Geer et al. ([2010i ). We present some 
theory for the adaptive estimator in FMR models in Section [H Furthermore, we report 
some empirical results in Section 17.11 indicating that the two-stage adaptive method often 
outperforms the one-stage £i-penalized estimator. 



3.4 Selection of the tuning parameters 



The regularization parameters to be selected are the number of components fc, the penalty 
parameter A and we may also want to select the type of the penalty function, i.e., selection 
of 7. 



One possibility is to use a modified BIG criterion which minimizes 



BIG 



-2^(^g)+log(n)de 



(15) 



"{"1) 

over a grid of candidate values for k, A and maybe also 7. Here, ^j^^ denotes the estimator 
in ([9]) using the parameters X,k,^ in ([TO]) . and —£{•) is the negative log-likelihood. Fur- 
thermore, de = A; + (A;— l)+^^.^-i^ pr=i k^{4, .-/.Qj is the eSective numhev of pavametevs 
(|Pan and Shenl . l2007l ^ . 



Alternatively, we may use a cross-validation scheme for tuning parameter selection mini- 
mizing some cross-validated negative log-likelihood. 



Regarding the grid of candidate values for A, we consider < Ai < 
where Amax is given by 



Ar 



max 



V^\\Y\ 



< Am < Amax) 

(16) 



At A 

niax) s-ll coefficients cpj, (j — 1, ... ,p) of the one-component model are exactly zero. 
Equation (fTU|) easily follows from Proposition [H 
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For the adaptive £i-norm penalized estimator minimizing the criterion in (|13p . we proceed 
analogously but replacing 9^^l in (fT5]l by ^^dapt A ™ (fT4l) . As initial estimator in the 
adaptive criterion, we propose to use the estimate in Q which is optimally tuned using 
the modified BIC or some cross-validation scheme. 



4 Asymptotics for fixed p and k 



Following the penalized likelihood theory of iFan and Lil (1200111. we esta blish first some 
asymptotic properties of the estimator in (jlOp . As in iFan and Lil ((20011) , we assume in 
this section that the design is random and that the number of covariates p and the number 
of mixture components k are fixed as sample size n — )• cxd. Of course, this does not reflect 
a truly high-dimensional scenario, but the theory and methodology is much easier for this 
case. An extended theory for p potentially very large in relation to n is presented in 
Section O 

Denote by 6*0 the true parameter. 



Theorem 1. (Consistency) Consider model ^ with random design, fi xed p and k. If \ = 
0(n~^/^) (n — )• oo) then, under the regularity conditions (A)-(C) from Fan and l\ i '200i) 
on the joint density function of{Y,X), there exists a local minimizer 6^^^ ''Z ~^~^^pcn a(^) 
in f7 e {0, 1/2, 1}; such that 



i(7) 
A 



Op(l). 



A proof is given in Appendix [XI Theorem [T] can be easily misunderstood. It does not 
guarantee the existence of an asymptotically consistent sequence of estimates. The only 
claim is that a clairvoyant statistician (with pr e-knowledge of ^n) can choose a consistent 



sequence of roots in the neighborhood of (|van der Vaartl . 120071 ). In the case where 



-n 



-1^(7) 
'-pen, A 



{9) has a unique minimizer, which is the case for a FMR model with one 
component, the resulting estimator would be root-n consistent. But for a general FMR 
model with more than one component and typically several local minimizers, this does not 
hold anymore. In this sense, the preceding theorem might look better than it is. 

Next, we present an asymptotic oracle result in the spirit of iFan and l] (|2001I ) for the two- 
stage adaptive procedure described in Section [3^ Denote by S the population analogue of 
(112p . i.e., the set of non-zero regression coefficients. Furthermore, let Og = ({(/>rj; {r^j) G 
5}, pi, . . . , pfc, TTi, . . . , 7rfc_i) be the sub- vector of parameters corresponding to the true non- 
zero regression coefficients (denoted by S) and analogously for 6s- 



Theorem 2. (Asymptotic oracle result for adaptive procedure) 

Consider model (0) with random design, fixed p and k. If X = o(n~^/^), nX — )• oo and 
if satisfies 0™' — = Op(n~^/^), then, under the regularity conditions (A)-(C) from 
Fan and on the joint density function of {Y,X), there exists a local minimizer 

^idipt;A «/-"~'4ipt(^) ^nm) {0,1/2,1}; wMch satisfies: 
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1. Consistency in variable selection: IP[5'^dapt A = S] ^ 1 {n ^ oo). 

2. Oracle Property: (^idipt;A,5 " ^0,5) -^'^ M{0, IsiOo)'^), where /^(^o) «s the 
Fisher information knowing that 6s<^ = (i.e., the submatrix of the Fisher infor- 
mation at 9o corresponding to the variables in S). 

A proof is given in Appendix |Al As in Theorem [H the assertion of the theorem is only 
making a statement about some local optimum. Furthermore, this result only holds for 
the adaptive criterion with weights Wr j = nsrr coming from a root-n consistent initial 

estimator 9™^: this is a rather strong assumption given the fact that Theorem [T] only 
ensures existence of such an estimator. The non-adaptive estimator with the £i-norm 
penalty as in (jlOh cannot achieve sparsity a nd maintain root-n consi stency due to the bias 
problem mentioned in Section E3] (see also lKhahh and ChenI tooi )). 



5 General theory for high-dimensional setting with non- 
convex smooth loss 

We present here some theory, entirely different from Theorems [T] and O which reflects 
some consistency and optimality behavior of the ^i-norm penalized maximum likelihood 
estimator for the potentially high-dimensional framework with p ^ n. In particular, 
we derive some oracle inequality which is non-asymptotic. We intentionally present this 
theory for £i-penalized smooth likelihood problems which are generally non-convex; £1- 
penalized likelihood estimation in FMR models is then a special case discussed in Section 
15.31 The following Sections 15.11 - 15.21 introduce some mathematical conditions and derive 
auxiliary results and a general oracle inequality (Theorem [3]) ; the interpretation of these 
conditions and of the oracle result is discussed for the case of FMR models at the end of 
Section 15.3.11 



5.1 The setting and notation 

Let {/^; tp G ^} he a collection of densities with respect to the Lebesgue measure /i on 
M (i.e., the range for the response variable). The parameter space ^ is assumed to be a 
bounded subset of some finite-dimensional space, say 

where we have equipped (quite arbitrarily) the space with the sup-norm HV'lloo = 
maxi<j<£^ \ipj\. In our setup, the dimension d will be regarded as a fixed constant (which 
still covers high-dimensionality of the covariates, as we will see). Then, equivalent metrics 
are, e.g., the ones induced by the ^g-norm HV'llg = (Z]j=i I'^^jlgY^'^ (q ^ !)• 

We observe a covariate X in some space X gMP and a response variable y G M. The true 
conditional density of Y given X = x is assumed to be equal to 
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where 

^poix) G 'f, V x G Af. 

That is, we assume that the true conditional density of Y given x is depending on x only 
through some parameter function tpo{x). Of course, the introduced notation also applies 
to fixed instead of random covariates. 

The parameter {^0(2^); ^ G X} is assumed to have a nonparametric part of interest 
{goix)', X £ X} and a low-dimensional nuisance part 770! i-e-, 



V'o( 



with 



goix) G E^ V X G Af, r/o G M™, k + m = d. 



kT 



bj^x) and r] involves the parameters 



In case of FMR models, g{x)^ 

. . . , TTi, . . . , TTk-i- More details are given in Section 15.31 

With minus the log-likelihood as loss function, the so-called excess risk 

u 



£:(V|Vo 



log 



is the Kullback-Leibler information. For fixed covariates xi, . 
excess risk 



,Xn, we define the average 



and for random design, we take the expectation E[i?('(/'(X)|^o(^))]- 



5.1.1 The margin 



Following iTsvbakov (I2OO4I ) and Ivan de Geerl (120081 ) we call the behavior of the excess risk 
'^(V'l^o) near ^/^q the margin. We will show in Lemma[T]that the margin is quadratic. 

Denote by 

h = log /,/. 

the log-density. Assuming the derivatives exist, we define the score function 

dip ' 



and the Fisher information 

m 



dipdtlj'^ ' 

Of course, we can then also look at I{ip{x)) using the parameter function ijj{x). 

In the sequel, we introduce some conditions (Conditions 1-5). Their interpretation for 
the case of FMR models is given at the end of Section [5. 3. 11 First, we will assume bound- 
edness of third derivatives. 
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Condition 1 It holds that 

sup max 

where 



ip&^ (iij2j3)e{i,...,c(}s 



^3 



'hi' 



sup / G3{y)f^g{y\x)dfi{y) < Cs < oo. 



<G'3(- 



For a symmetric, positive semi-definite matrix A, we let A'^^^{A) be its smallest eigenvalue. 

Condition 2 For all x £ X , the Fisher information matrix I{'iPq{x)) is positive definite 
and, in fact, 

Amin = inf A„ii„(/(V'o(a;))) > 0. 



Further we will need the following identifiability condition. 
Condition 3 For all e > 0, there exists an > 0, such that 



mf mf 

||i/'-i/'o(a;)|l2>' 



Based on these three conditions, we have the following result: 
Lemma 1. Assume Conditions 1, 2, and 3. Then 

mr ,, , - "2 — „2 ' 



where 



1 dK'^ 



max 



£0 OisQ 



2(^3/2 ■ 



A proof is given in Appendix iBl 



5.1.2 The empirical process 

We now specialize to the case where 

= (5(xf,r/^), 

where (with some abuse of notation) 

aixf^ = 9,l>{xf = {gi{x),...,gk{x)), 

9r{x) = g^,{x) = x'^cpr, X £ W , (pr £ M^, r = 1, 



,k. 
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We also write 

to make the dependence of the parameter function ^p{x) on more explicit. 
We will assume that 

sup ||^"^x||oo = sup max {(pjxl < K. 

This can be viewed as a combined condition on X and <j). For example, if X is bounded 
by a fixed constant this supremum (for fixed 4>) is finite. 



Our parameter space is now 



={(t>{,...Ai. ); sup x\\oo < K, ||r?||oo < K}. 



(17) 



Note that is, in principle, (pk + m)-dimensional. The true parameter t^q is assumed to 
be an element of 0. 



Let us define 



L^{x, •) = log/^(^)(-), ipixy = ip^{xY' = {g4>{xf,V^), 



'd = {4>{ ,...,^^,77), and the empirical process for fixed covariates xi,.... 

Vn{^) = - ^(L^(xi,yi)-E[L^(a;i,y)|x = Xi]) . 



Xr^ 



We now fix some T > 1 and Aq > and define the event 



T 



sup 



^T=(^T_^T)gQ (110 - 0o||i + h - r/olb) V Ao 



<rAo 



(18) 



5.2 Oracle inequality for the Lasso for non-convex loss functions 

For an optimality result, we need some condition on the design. Denote the active set, 
i.e., the set of non-zero coefficients, by 



and let 
Further, let 



5 = {(r,j); 0r,,7^O}, s = \Sl 
= {^(rj); {r,j) ^ J}, J C {1, . . . , fc} X {1, . . . ,p}. 



1 " 

= — y. 



XiX^ 



i=l 
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Condition 4 (Restricted eigenvalue condition). There exists a constant k > 1 such that, 
for all (j) G satisfying 

ll^sHli < 6||05||i, 

it holds that 

k 

For ipi'V = idi')'^ jV'^)j use the notation 

^ n k m 
i=l r=l j=l 

We also write for g{-) = (fifi(-), • • .,gk{-)V, 

^ n k 
^ i=l r=l 

Thus 

k 
r=l 

and the bound in the restricted eigenvalue condition then reads 

Il'^5lli<'^'ll5<^ll^„. 

Bounding ||5<^||q^ in terms of ||i;^5||2 can be done directly using, e.g., the Cauchy-Schwarz 
inequality. The restricted eigenvalue condition ensures a bound in the other direction 
which itself is needed for an oracle inequality. Some references about the restricted eigen- 
value condition are provided at the end of Section 15.3.11 

We employ the Lasso-type estimator 

^ n k 

= {4>^,fi^)= argmin J2L^{xi,Yi) + Xj^Urh- (19) 

^T=(0T^^T)ge n .^^ 

We omit in the sequel the dependence of i? on A. Note that we consider here a global 
minimizer which may be difficult to compute if the empirical risk '^^^^ L^{xi,Yi) is 
non-convex in ■(?. We then write ||(/)||i = Ylr=i W't'rWi- We let 

i^ixf = {g^{xf,f), 
which depends only on the estimate i?, and we denote by 
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Theorem 3. (Oracle result for fixed design). Assume fixed covariates xi, . . . ,x„, Condi- 
tions 1-3 and 4, o,nd that A > 2TAo for the estimator in ( figj) with T and Aq as in ^W^. 
Then on T , defined in [W\). for the average excess risk (average Kullback-Leibler loss), 

£{tl^\i^Q) + 2{X-TXo)Us4i < 8iX + TXofclKh, 

where cq and k are defined in Lemma [I] and Condition 4, respectively. 

A proof is given in Appendix|Bj We will give an interpretation of this result in Section fS.S.H 
where we specialize to FMR models. In the case of FMR models, the probability of the 
set T is large as shown in detail by Lemma [3] below. 

Before specializing to FMR models, we present more general results for lower bounding 
the probability of the set T. We make the following assumption. 



Condition 5 For the score function s^( 



s^^(-), we have 



sup ||s^l(-)||oo < Gi( 



for some function Gi 



Condition 5 primarily has notational character. Later, in Lemma [2] and particularly in 
Lemma El the function Gi(-) needs to be sufficiently regular to ensure small corresponding 
probabilities. 



Define 



Ao = Mn log n 



log(p V n) 



n 



(20) 



As we will see, we usually choose M„ x Y^log(n). Let Px denote the conditional proba- 
bility given (Xi, . . . , X„) = (xi, . . . , Xn) = x, and with the expression 1{-} we denote the 
indicator function. 



Lemma 2. Assume Condition 5. We have for constants ci, C2 and C3 depending on k 
and K , and for all T > ci, 



sup 



with Px probability at least 

log^nlog(p V n) 



+ h - Voh) V Ac 



<TXo, 



1 — C2 exp 



-Y,FiY^>TXl/{dK) 



i=l 



where (for i = 1, . . . ,n) 

F{Yi) = Gi{Yi)l{Gi{Yi) > Mn} + E[Gi{Y)l{Gi(Y) > Af„} 

Regarding the constants Aq and K, see i20\) and (11), respectively. 
A proof is given in Appendix [Bl 



X = Xi 
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5.3 FMR models 



In the finite mixture of regressions model from ([2]) with k components, the parameter is 
i}T = {(f^.T^) = ((/)f,...,(/)^,logpi,...,logpfc,log7ri,...,log7rfe„i), where the Pr = cr"^ 
are the inverse standard deviations in mixture component r and the vr^ are the mixture 
coefficients. For mathematical convenience and simpler notation, we consider here the 
log-transformed p and tt parameters in order to have lower and upper bounds for p and 
TT. Obviously, there is a one-to-one correspondence between 'd and 6 from Section [2.1[ 

Let the parameter space be 

G C {??^;sup||</.^x||oo < i^,||log/)||oo </C,-K<log7ri <0,..., 

k-l 

-K<\og T^k-l < 0, < 1}, (21) 

r=l 

and TTfc = 1 - Ylr=l'^r- 

We consider the estimator 

n / k ^ \ k 

i?A = argmin-n-i^log K^7r,^exp(--(p,yi-Xf0,)2) + A ^ ||(/>,||i. (22) 
"fee i=i Vr=l ^ / r=l 

This is the estimator from Section 13.21 with 7 = 0. We emphasize the boundedness of 
the parameter space by using the notation G. In contrast to Section HI we focus here on 
any global minimizer of the penalized negative log-likelihood which is arguably difficult to 
compute. 

In the following, we transform the estimator to 6x in the parameterization 6 from 
Section [2.11 Using some abuse of notation we denote the average excess risk by £{9\\9q). 

5.3.1 Oracle result for FMR models 

We specialize now our results from Section [5.21 to FMR models. 

Proposition 3. For fixed design FMR models as in ^ with Q in {SP, Conditions 1,2 
and 3 are met, for appropriate C3, Amin and {a^}, depending on k and K. Also Condition 
5 holds with 

Gi{y) = e''\y\ + K. 

Proof. This follows from straightforward calculations. □ 

In order to show that the probability for the set T is large, we invoke Lemma [2] and the 
following result. 
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Lemma 3. For fixed design FMR models as in (0j with Q in \21\) . for some constants 
C4, C5 and cq, depending on k, and K, and for Mn = C/^^Jlogn and n > cg, the following 
holds: 



1 v-^ ^/^.x loff n \ 1 
-Y^F{Yi)>c^^] <-, 

\ i=l / 



X = Xi 



where (for i = 1, . . . ,n) 

F{Yi) = Gi{Yi)l{Gi{Yi) > Mn} + E[Gi{Y)l{Gi{Y) > M„} 

and Gi(-) is as in Proposition^^ 
A proof is given in Appendix [Bj 

Hence, the oracle result in Theorem [3] for our ^i-norm penalized estimator in the FMR 
model holds on a set T, summarized in Theorem HI and this set T has large probability 
due to Lemma [2] and Lemma [3] as described in the following corollary. 



Corollary 1. For fixed design FMR models as in with @ in [2l\). we have for constants 
C2,C4,C7,C8 depending on k and K, 

— n^^ for all n > cg, 



where T is defined with Aq = M„ log n^J '"g'-P^"-' and Mn = C4\/log n. 
Theorem 4. (Oracle result for FMR models). Consider a fixed design FMR model as 
in (0j with G in h21\). Assume Condition 4 (restricted eigenvalue condition) and that 
A > 2TAo for the estimator in Ii2^) . Then on T, which has large probability as stated in 
Corollary Ul for the average excess risk (average Kullback-Leibler loss), 

8{k\Oo) + 2(A - TAo)||(A5=||i < 8(A + TX^fcl^h, 

where cq and k are defined in Lemma [I] and Condition 4, respectively. 

The oracle inequality of Theorem H] has the following well-known interpretation. First, we 
obtain 

£{Ox\eo) < 8{X + TXofclKh. 

That is, the average Kullback-Leibler risk is of the order O(sAq) = 0{s log^ n log(pV n) /n) 
(take A = 2TXq, use definition (j20p and the assumption on M„ in Lemma [3] above) which 
is up to the factor log^ nlog(p V n) the optimal convergence rate if one would know the s 
non-zero coefficients. As a second implication we obtain 

||05-||i <4(A + rAo)c2K2s, 

saying that the noise components in S'^ have small estimated values (e.g., its £i-norm 
converges to zero at rate O(sAo)). 



[T]>l-C2 exp 



T log n log(p V n) 
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Note that the Conditions 1, 2, 3 and 5 hold automatically for FMR models, as described 
in Proposition [3l We do require a restricted eigenvalue condition on the design, here Con- 



dition 4. In fact, for the Lasso or Dantzig se . 
conditions (IKoltchinskiil. 



200S 



ence conditions (Bunea et al. 



ector in linear models, restricted eigenvalue 



Bickel et al.- 
20071: ICai etaV . 



20091^ a re considerably weaker than coher- 



, _ , _ . 2009a) or assuming the restricted isometry 

pro perty (ICande s and TaJ, '2OO5I; ^C ai et al\ l2009b ): for an overview among the relations. 



see 



van de Geer and Biihlmann (,20091 ) . 



5.3.2 High-dimensional consistency of FMR models 



We finally give a consistency result for FMR models under weaker conditions than the 
oracle result from Section [5.3. II Denote by 9q the true parameter vector in a FMR model. 
In contrast to Section HI the number of covariates p can grow with the number of obser- 
vations n. Therefore, also the true parameter ^0 depends on n. To guarantee consistency 
we have to assume some sparsity condition, i.e., the £i-norm of the true parameter can 

only grow with o{\J n/(log^ n log(p V n))). 

Theorem 5. (Consistency). Consider a fixed design FMR model ^ with Q in (2l\) and 
fixed k. Moreover, assume that 



WMi = ^\\4>o,rh = o(Y'n/(log^nlog(pVn))) (n 00). 

r=l 

If X = C\J\o^ n\og{p y n)/n for some C > sufficiently large, then any (global) mini- 
mizer 6\ as in i22\) satisfies 

£{ex\eo) = op{l) (n^oo). 



A proof is given in Appendix[Bj The (restricted eigenvalue) Conditio n 4 on the design is not 
requir ed; this is typical for a high-dimensional consistency result, see lGreenshtein and Ritov 
ibr the Lasso in linear models. 



6 Numerical optimization 

We present a generalized EM (GEM) algorithm for optimizing the criterion in (|10p in Sec- 
tion 16.11 In Section 16.21 and 16.31 we give further details on speeding-up and on initializing 
the algorithm. Finally, we discuss numerical convergence properties in Section 16.41 For 
the convex penalty (7 = 0) function, we prove convergence to a stationary point. 
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6.1 GEM algorithm for optimization 



Maximization of the log- likelihood of a mixture density is often done using the traditional 
EM algorithm of Dempster et al. ( 19771 ) . Consider the complete log-likelihood 



n k / \ 

IMY,A) = j;^A,,logf-^e-^(''^^-^"<^^)M+A,,log(vr,). 

i=l r=l 

Here (Aj^i, . . . , Aj^fc), i = 1, . . . ,n, are i.i.d. unobserved multinomial variables showing the 
component-membership of the ith observation in the FMR model: Aj^^ = 1 if observation i 
belongs to component r, and Aj = otherwise. The expected complete (scaled) negative 
log-likelihood is then 

Q(9\e') = -n-'Ee:[U9;Y,A)\Y], 
and the expected complete (scaled) penalized negative log-likelihood is 



Qpen{0\O') = Qi9\e') + XY,^: 



k 



r 1 1 1 1 1 • 

r=l 



The EM algorithm works by alternating between the E- and M-Step. Denote the param- 
eter value at EM-iteration m by 0^"^^ {m = 0, 1,2, . . .), where 9^^^ is a vector of starting 
values. 



E-Step: Compute Q{9\9^"^^), or equivalently for r = 1, . . . , /c and i = 1, 



,n 



Generalized M-Step: Improve Qpcn{9\9^'^'^) w.r.t 61 G 6. 
a) Improvement with respect to ir = (vri, . . . , vr^).' 
fix (p at the present value <j)^"^'^ and improve 

n k k 

-n-'Y^Yl + ^ E ""rUi""^ 111 (23) 

1=1 r=l r=l 

with respect to the probability simplex 

k 

{vr; TTr > for r = 1, . . . , A: and tt^. = 1}. 

r=l 

Denote by 7r^"^~^^^ = ^' which is a feasible point of the simplex. We propose to 
update TT as 

^(m+l) _ ^(m) _|_ ^(m)^-(m+l) _ ^M^j 
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where t^™") € (0,1]. In practice, t^™"^ is chosen to be the largest value in the grid 
{S'^; fc = 0, 1, 2, . . .} {0 < 6 < 1) such that (f23]) is not increased. In our examples, 
6 = 0.1 worked well. 

b) Coordinate descent improvement with respect to (p and p: 

A simple calculation shows that the M-Step decouples for each component into k 
distinct optimization problems of the form 

- log(p,) + ^\\prY - ±ct>rf + — Ur^^A' m\u r- = 1, . . . , fe (24) 

with 

n 

nr = ^7i,r, iYi,Xi) = y^%^(Yi,Xi), r = 1, . . . , fe. 

i=l 

Problem (plj) has the same form as dH); in particular, it is convex in {pr, 4>r,i, ■ ■ ■ , 4'r,p)- 
Instead of fully optimizing (I24p , we only minimize with respect to each of the coordi- 
nates, holding the other coordinates at their current value. Closed-form coordinate 
updates can easily be computed for each component r (r = 1, . . . , k) using Proposi- 
tion [U 



(^+1) ^ (y,x#^) + V(y,x4"^))^ + 4||y||V 

2||y||2 



,(m+l) 
r,j 



where Sj is defined as 



if \Sj\ < nX i TT: 



(m+l) 



IIX 112 



IIX, IP 



if Sj > n 



if 5, <-nA(^^+')^^ 



s<j 



s>j 



and j = 1, . . . ,p. 

Because we only improve <5pen(^|^''™'^) instead of a full minimization, see M-Step a) and 
b), this is a generalized EM (GEM) algorithm. We call it the block coordinate descent 
generalized EM algorithm (BCD-GEM); the word block refers to the fact that we are 
updating all components of tt at once. Its numerical properties are discussed in Section 

Ml 

Remark 1. For the convex penalty function with j = 0, a minimization with respect to 
TT in M-Step a) is achieved with vr^'""'"^) = ^' , i.e., using t^"*) = 1. Then, our M-Step 
corresponds to exact coordinate-wise minimization o/ (5pen(^|^'-™^)- 
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6.2 Active set algorithm 



There is a simple way to speed-up the algorithm described above. When updating the 
coordinates in the M-Step b), we restrict ourselves during every 10 EM-iterations to 
the current active set (the non-zero coordinates) and visit the remaining coordinates every 
11th EM- iteration to update the active set. In very high-dimensional and sparse settings, 
this leads to a remarkable dec r ease i n cor nputational t imes. A similar active set strategy 
is also used in Friedman et al. ( 2007 ) and Meier et al. ( 20081 ) . We illustrate in Section [73] 
the gain of speed when staying during every 10 EM-iterations within the active set. 



6.3 Initialization 

The algorithm of Section [6.11 requires the specification of starting values 6'^^\ We found 
empirically that the following initialization works well. For each observation i, i = 1, . . . ,n, 
draw randomly a class k G {1, . . . ,k}. Assign for observation i and the corresponding 
component n the weight ^{^^ = 0.9 and weights ji^r = 0.1 for all other components. Finally, 
normalize ji^r, r = 1, . . . ,k, to achieve that summing over the indices k yields the value 
one, to get the normalized values ji^r- Note that this can be viewed as an initialization of 
the E-Step. In the M-Step which follows afterwards, we update all coordinates from the 
initial values = 0, pi^^ = 2, iri^^ = 1/k, r = I, . . . ,k, j = 1, . . . ,p. 



6.4 Numerical convergence of the BCD-GEM algorithm 

We address here the convergence properties of the BCD-GEM algorithm described in 
Section 16.11 A detailed account of the converg ence properties of the EM algorithm in 
a general setting has been given by [Wd (Il983l ) . Under regularity conditions including 



differentiability and continuity, convergence to stationary points is proved for the EM 
algorithm. For the GEM algorithm, similar statements are true under conditions which 
are often hard to verify. 

As a GEM algorithm, our BCD-GEM algorithm has the descent property which means 
that the criterion function is reduced in each iteration: 

Since —n'^^^pcn a(^) ^® bounded from below (Proposition [2|), the following result holds. 

Proposition 4. For the BCD- GEM algorithm, —n^^^penX^^^"^'') decreases monotonically 
to some value £ > — oo. 

In Remark [H we noted that, for the convex penalty function with 7 = 0, the M-Step of 
the algorithm corresponds to exact coordinate-wise minimization of Qpen{(^\^^"^^)- In this 
case, convergence to a stationary point can be shown. 
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Theorem 6. Consider the BCD-GEM algorithm for the criterion function in ilO\) with 
7 = 0. Then, every cluster point 6 £ Q of the sequence {9^"^^; m = 0, 1, 2, . . .}, generated 
by the BCD-GEM algorithm, is a stationary point of the criterion function in ilO\) . 

A precise definition of a stationary point in a non-differentiable setup and a proof of the 
Theorem are given in Appendix [Cl The proof uses the crucial facts that QpenC^I^') is a 
convex function in and that it is strictly convex in each coordinate of 0. 



7 Simulations, real data example and computational tim- 
ings 

7.1 Simulations 

We consider four different simulation setups. Simulation scenario 1 compares the perfor- 
mance of the unpenalized MLE with our estimators from Section 13.21 (FMRLasso) and 
Section 13.31 (FMRAdapt) in a situation where the total number of noise covariate s grows 



successively. For computing the un penalized MLE, we used the R-package FlexMix (jLeisch 



2004 : Griin and Leischl . 2007 . 20081 ): Simulation 2 explores sparsity; Simulation 3 compares 



cross-validation and BIG; and Simulation 4 compares the different penalty functions with 
the parameters 7 = 0, 1/2, 1. For every setting, the results are based on 100 independent 
simulation runs. 

All simulations are based on Gaussian FMR models as in ([T|; the coefficients tTj., fir, o'r and 
the sample size n are specified below. The covariate X is generated from a multivariate 
normal distribution with mean and covariance structure as specified below. 

Unless otherwise specified, the penalty with 7 = 1 is used in all simulations. As explored 
empirically in Simulation 4, in case of balanced problems (approximately equal vr^), the 
FMRLasso performs similarly for all three penalties. In unbalanced situations, the best 
results are typically achieved with 7=1. In addition, unless otherwise specified, the true 
number of components k is assumed to be known. 

For all models, training-, validation- and test data are generated of equal size n. The esti- 
mators are computed on the training data, with the tuning parameter (e.g.. A) selected by 
minimizing twice the negative log-likelihood (log-likelihood loss) on the validation data. As 
performance measure, the predictive log-likelihood loss (twice the negative log-likelihood) 
of the selected model is computed on the test data. 

Regarding variable selection, we count a covariable X^^^ as selected if (3r,j 7^ for at least 
one r £ {1, . . . ,k}. To assess the performance of FMRLasso on recovering the sparsity 
structure, we report the number of truly selected covariates (True Positives) and falsely 
selected covariates (False Positives). 

Obviously, the performances depend on the signal-to-noise ratio (SNR) which we define 
for an FMR model as 

S^j^ _ Var(y) _ Eti Mf^J Cov{X)Pr + ol) 



Var(y|/3, = 0;r = l,...,A;) Er=i ^r^r^ 
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where the last equahty follows since E[X] = 0. 
7.1.1 Simulation 1 

We consider five different FMR models: Ml, M2, M3, M4 and M5. The parameters 
{TTr, /3r,(7r), the Sample size n of the training-, validation- and test-data, the correlation 
structure of covariates corr and the signal-to -noise ratio (SNR) are specified 

in Table [TJ Models Ml, M2, M3 and M5 have two components and five active covariates, 
whereas model M4 has three components and six active covariates. Ml, M2 and M3 differ 
only in their variances af, a"^ and hence have different signal-to- noise ratios. Model M5 
has a non-diagonal covariance structure. Furthermore, in model M5, the variances af, 
are tuned to achieve the same signal-to- noise ratio as in model Ml. 

We compare the performances of the maximum likelihood estimator (MLE) , the FMRLasso 
and the FMRAdapt in a situation where the number of noise covariates grows successively. 
For the models Ml, M2, M3, M5 with two components, we start with ptot = 5 (no noise 
covariates) and go up to ptot = 125 (120 noise covariates). For the three component model 
M4, we start with ptot = 6 (no noise covariates) and go up to ptot = 155 (149 noise 
covariates) . 

The boxplots in Figures [T] - [5] of the predictive log-likelihood loss, denoted by Error, the 
True Positives {TP) and the False Positives (FP) over 100 simulation runs summarize the 
results for the different models. We read off from these boxplots that the MLE performs 
very badly when we add noise covariates. On the other hand, our penalized estimators 
remain stable. For example, for Ml the MLE with ptot = 20 performs worse than the 
FMRLasso with ptot = 125, or for M4 the MLE with ptot = 10 performs worse than the 
FMRLasso with ptot = 75. Impressive is also the huge gain of the FMRAdapt method 
over FMRLasso in terms of log- likelihood loss and false positives. 





Ml 


M2 


M3 


M4 


M5 


n 


100 


100 


100 


150 


100 


Pi 


(3,3,3,3,3) 


(3,3,3,3,3) 


(3,3,3,3,3) 


(3,3,0,0,0,0) 


(3,3,3,3,3) 


h 


(-1,-1,-1,-1,-1) 


(-1,-1,-1,-1,-1) 


(-1,-1,-1,-1,-1) 


(0,0,-2,-2,0,0) 


(-1,-1,-1,-1,-1) 


^3 








(0,0,0,0,-3,2) 




a 


0.5, 0.5 


1, 1 


1.5, 1.5 


0.5, 0.5, 0.5 


0.95, 0.95 


7T 


0.5, 0.5 


0.5, 0.5 


0.5, 0.5 


1/3, 1/3, 1/3 


0.5, 0.5 


corr(X«,X('")) 










g g|!-m| 


SNR 


101 


26 


12.1 


53 


101 



Table 1: Models for simulation 1. 5i^rn denotes Kronecker's delta. 



7.1.2 Simulation 2 

In this section, we explore the sparsity properties of the FMRLasso. The model speci- 
fications are given in Table [2j Consider the ratios pact : n- '■ Ptot- The total number of 
covariates ptot grows faster than the number of observations n and the number of active 
covariates pact: when ptot is doubled, pact is raised by one and n is raised by 50 from 
model to model. In particular, we obtain a series of models which gets "sparser" as n 
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Figure 1: Simulation 1, Model Ml. Top: Predictive log-likelihood loss (Error) for MLE, 
FMRLasso, FMRAdapt. Bottom: False Positives (FP) and True Positives (TP) for FM- 
RLasso and FMRAdapt. 

grows (larger ratio n/pact)- In order to compare the performance of the FMRLasso, we 
report the True Positive Rate (TPR) and the False Positive Rate (FPR) defined as: 

#truly selected covariates 
^active covariates ' 
T^falsely selected covariates 
T^inactive covariates 

These numbers are reported in Figure [H We see that the False Positive Rate approaches 
zero for sparser models, indicating that the FMRLasso recovers the true model better in 
sparser settings regardless of the large number of noise covariates. 



TPR = 
FPR = 



7.1.3 Simulation 3 



So far, we regarded the number k of components as given, while we have chosen an 
optimal Aopt by minimizing the negative log-likelihood loss on validation data. In this 
section, we compare the performance of 10-fold cross-validation and the BIC criterion 
presented in Section 13.41 for selecting the tuning parameters k and A. We use model Ml 
of Section [7.1.11 with ptot = 25,50,75. For each of these models, we tune the FMRLasso 
estimator according to the following strategies: 
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320 
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(3, 3, 3, 0, 0, ...) 
















(-1,-1,-1,0, 0, ...) 








a 








0.5, 0.5 








TT 








0.5, 0.5 









Table 2: Series of models for simulation 2 which gets "sparser" as n grows: when ptot is 
doubled, pact is raised by one and n is raised by 50 from model to model. 



3:50:10 4:100:20 5:150:40 6:200:80 7:250:160 8:300:320 9:350:640 
Pact : n : Plot 



3:50:10 4:100:20 5:150:40 6:200:80 7:250:160 8:300:320 9:350:640 
Pact : n : Ptot 



Figure 6: Simulation 2 compares the performance of the FMRLasso for a series of models 
which gets "sparser" as the sample size grows. Top: True Positive Rate {TPR). Bottom: 
False Positive Rate (FPR) over 100 simulation runs. 

(1) Assume the number of components is given (k = 2). Choose the optimal tuning 
parameter Aopt using 10-fold cross-validation. 

(2) Assume the number of components is given (k = 2). Choose Aopt by minimizing the 
BIC criterion (fT5]) . 

(3) Choose the number of components k G {1,2,3} and Aopt by minimizing the BIC 
criterion (jl5p . 



27 



The results of this simulation are presented in Figure[71 where boxplots of the log- likelihood 
loss (Error) are shown. All three strategies perform equally well. With ptot = 25 the BIG 
criterion in strategy (3) always chooses k = 2. For the model with ptot = 50, strategy (3) 
chooses A; = 2 in 98 simulation runs and k = 3 in two runs. Finally, with ptot = 75, the 
third strategy chooses fc = 2 in 92 runs and A; = 3 eight times. 

7.1.4 Simulation 4 

In the preceding simulations, we always used the value 7 = 1 in the penalty term of 
the FMRLasso estimator ()10p . In this section, we compare the FMRLasso for different 
values 7 = 0,1/2,1. First, we compute the FMRLasso for 7 = 0,1/2,1 on model Ml 
of Section 17.1.11 with ptot = 50. Then we do the same calculations for an "unbalanced" 
version of this model with tti = 0.3 and tt2 = 0.7. 

In Figure [HI the boxplots of the log-likelihood loss (Error), the False Positives (FP) and 
the True Positives (TP) over 100 simulation runs are shown. We see that the FMRLasso 
performs similarly for 7 = 0, 1/2, 1. Nevertheless, the value 7 = 1 is slightly preferable in 
the "unbalanced" setup. 



M1 : p,„, =25 M1 : p,„, =50 M1 : p,„, =75 




Figure 7: Simulation 3 compares different strategies for choosing the tuning parameters 
k and A. The boxplots show the predictive log-likelihood loss (Error) of the FMRLasso, 
tuned by strategies (1), (2) and (3), for model Ml with ptot = 25, 50, 75. 
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7^0 Y=1/2 7^1 
Error 



7^0 Y=1/2 7^1 



7^0 Y=1/2 7^1 
FP 



7^0 Y=1/2 7^1 
TP 



7^0 Y=1/2 7^1 



7^0 Y=1/2 7^1 



Figure 8: Simulation 4 compares the FMRLasso for different values 7 = 0, 1/2, 1. The up- 
per row of panels shows the boxplots of the log- likelihood loss (Error), the False Positives 
(FP) and the True Positives (TP) for model Ml with ptot = 50 and tti = 7r2 = 0.5. The 
lower row of panels shows the same boxplots for an "unbalanced" version of model Ml 
with TTi = 0.3 and 7r2 = 0.7. 
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7.2 Real data example 



We now apply the FMRLasso to a dataset of riboflavin (vitamin B2) production by Bacillus 
Subtilis. The real-valued response variable is the logarithm of the riboflavin production 
rate. The data has been kindly provided by DSM (Switzerland). There are p = 4088 
covariates (genes) measuring the logarithm of the expression level of 4088 genes and mea- 
surements of n = 146 genetically engineered mutants of Bacillus Subtilis. The population 
seems to be rather heterogeneous as there are different strains of Bacillus Subtilis which 
are cultured under different fermentation conditions. We do not know the different homo- 
geneity subgroups. For this reason, a FMR model with more than one component might 
be more appropriate than a simple linear regression model. 

We compute the FMRLasso estimator for k = 1, . . . , 5 components. To keep the computa- 
tional effort reasonable, we use only the 100 covariates (genes) exhibiting the highest em- 
pirical variances. We choose the optimal tuning parameter Aopt by 10-fold cross-validation 
(using the log-likelihood loss). As a result, we get flve different estimators which we com- 
pare according to their cross- validated log-likelihood loss {CV Error). These numbers are 
plotted in Figure [9l The estimator with three components performs clearly best, resulting 
in a 17% improvement in prediction over a (non-mixture) linear model, and it selects 51 
genes. In Figure \TU[ the coefficients of the 20 most important genes, ordered according 
to Ylr=i \^r,j\, are shown. From the important variables, only gene 83 shows the opposite 
sign of the estimated regression coefficients among the three different mixture components. 
However, it happens that some covariates (genes) exhibit a strong effect in one or two mix- 
ture components but none in the remaining other components. Finally, for comparison, 
the one-component (non- mixture) model selects 26 genes, of which 24 are also selected in 
the three-component model. 



CV Error 



2 3 4 

number of components 



Figure 9: Riboflavin production data. Cross- validated negative log-likelihood loss {CV 
Error) for the FMRLasso estimator when varying over different numbers of components. 
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Figure 10: Riboflavin production data. Coeflicients of the 20 most important genes, 
ordered according to Ylr=i \i^r,j\, for the prediction optimal model with three components. 



7.3 Computational timings 



In this section, we report on the run times of the BCD-GEM algorithm on two high- 
dimensional examples. In particular, we focus on the substantial gain of speed achieved 
by using the active set version of the algorithm described in Section 16.21 All computations 
were carried out with the statistical computing language and environment R. Timings 
depend on the stopping criterion used in the algorithm. We stop the algorithm if the 
relative function improvement and the relative change of the parameter vector are small 
enough, i.e.. 



(7) 
pen, A ^ 



l + \i- 



-1))| 



< T, 




with r = 10"^. 



We consider a high-dimensional version of the two component model Ml from Section lY.l.ll 
with n = 200, ptot = 1000 and the riboflavin dataset from Section 17.21 with three compo- 
nents, n = 146 and ptot = 100. We use the BCD-GEM algorithm with and without active 
set strategy to flt the FMRLasso on a small grid of eight values for A. The corresponding 
BIG, CPU times (in seconds) and number of EM-iterations are reported in Tables [3] and 
m The values for the BCD-GEM without active set strategy are written in brackets. For 
model Ml and an appropriate A with minimal BIG score, the active set algorithm con- 
verges in 5.96 seconds whereas the standard BCD-GEM needs 53.15 seconds. There is 
also a considerable gain of speed for the real data: 0.89 seconds versus 3.57 seconds for A 
with optimal BIG. Note that in Table El the BIG scores sometimes differ substantially for 
inappropriate values of A. For such regularization parameters, the solutions are unstable 
and different local optima are attained depending on the algorithm used. However, if the 
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regularization parameter is in a reasonable range with low BIC score, the results stabilize. 



A 


10.0 15.6 21.1 26.7 32.2 37.8 43.3 48.9 


BIC 
CPU [s] 
# EM-iter. 


2033 (2022) 1606 (1748) 951 (959) 941 (940) 989 (983) 1236 (1073) 1214 (1216) 1206 (1203) 
26.78 (269.91) 17.05 (165.78) 8.63 (82.78) 5.96 (53.15) 5.08 (44.23) 4.23 (37.27) 3.35 (18.99) 3.30 (15.62) 
277.0 (341.5) 196.0 (205.0) 96.0 (100.5) 63.5 (64.5) 56.0 (53.5) 41.5 (46.0) 31.5 (23.0) 25.0 (19.0) 



Table 3: Model Ml with n = 200 and ptot = 1000. Median over 10 simulation runs of BIC, 
CPU times and number of EM-iterations for the BCD-GEM with and without active set 
strategy (the latter in brackets). 



A 


3.0 


13.8 


24.6 


35.4 46.2 


57.0 


67.8 


78.6 


BIC 


560 (628) 


536 (530) 


516 (522) 


532 (525) .541 (.540) 


561 (580) 


592 (591) 


611 (613) 


CPTT [si 


22.10 (29.98) 


1.35 (3.28) 


0.89 (3.57) 


0.86 (3.31) 0.78 (3.87) 


0.69 (2.12) 


0.37 (2.56) 


0.85 (4.05) 


# EAl-iu"-!-. 


:i'i8<) (2()T<S) 


:-!i5 (239) 


2S7 (2()()) 


2I),S (217) 29() (290) 


2 IS (181) 


129 (192) 


313 (302) 



Tabic 4: Riboflavin data with k = 3, n = 146 and ptot = 100- BIC, CPU times and 
number of EM-iterations for the BCD-GEM with and without active set strategy (the 
latter in brackets). 



8 Discussion 

We have presented an .^i-penalized estimator for a finite mixture of high-dimensional 
Gaussian regressions where the number of covariates may greatly exceed sample size. 
Such a model and the corresponding Lasso-type estimator are useful to blindly account 
for often encountered inhomogeneity of high-dimensional data. On a high-dimensional real 
data example, we demonstrate a 17% gain in prediction accuracy over a (non-mixture) 
linear model. 

The computation and mathematical analysis in such a high-dimensional mixture model 
is challenging due to the non-convex behavior of the negative log-likelihood. Moreover, 
with high-dimensional estimation defined via optimization of a non-convex objective func- 
tion, there is a major gap between the actual computation and the procedure analyzed 
in theory. Wc do not provide an answer to this issue in this thesis. Regarding the 
computation in FMR models, a simple reparameterization is very beneficial and the ii- 
penalty term makes the optimization problem numerically much better behaved. We 
develop an efficient generalized EM algorithm and we prove its numerical convergence to 
a stationary point. Regarding the statistical properties, besides standard low-dimensional 
asymptotics, we present a non-asymptotic oracle inequality for the Lasso-type estimator 
in a high-dimensional setting with general, non-convex but smooth loss functions. The 
mathematical arguments are different than what is typically used for convex losses. 
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Appendices 



A Proofs for Section [4] 



A.l Proof of Theorem [T] 



We assume the reg ularity assumption s (A)-(C) of Fan and Li ( 200ll ). The theorem follows 
from Theorem 1 of Fan and Li ( 200ll ). □ 



A. 2 Proof of Theorem I 



In order to keep the notation simple, we give the proof for a two class mixture with k = 2. 
All arguments in the proof can also be used for a general mixture with more than two 
components. Remember that — n~^£adapt(^) is given by 



-n-i4daptW = -n'^^(^) + A(7r7^«;ij|</>i,,-| + (l 



VTi 



p 



where i{0) is the log-likelihood function. The weights Wrj are given by Wrj 
r = 1,2, and j = 1, . . . ,p. 



Assertion 1. 

Let 9 be a root-n consistent local minimizer of — ?^~^^adapt(^) (construction as in lFan and Li 

For all {r,j) G S, we easily see from consistency of 9 that ¥[{r,j) G S"] — t- 1. It then remains 
to show that for all {r,j) E S"^, P[(r, j) E S"^] — )• 1. Assume the contrary, i.e., w.l.o.g there 
is an s E {1, . . . ,p} with (pi^g = such that </>i^s ^ with non- vanishing probability. 



By Taylor's theorem, applied to the function n ^ 
on the line segment between 9q and 9 such that 



there exists a (random) vector 9 



1 dl 



adapt 



n 




'o + 



6*0) - A7r7'Wi,sSgn(0i,5 



Now, using the regularity assumptions and the central limit theorem, term (1) is of order 
Op{-^). Similarly, term (2) is of order Op{l) by the law of large numbers. Term (3) 
is of order Op(l) by the law of large numbers and the regularity condition on the third 
derivatives (condition (C) of lFan and Lil (|200ll )). Therefore, we have 



1 , 

n 



"-adapt 



Op(^) + (Op(l) + {9- 9oY Op{l)){9 - 9o) 
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As 6 is root-n consistent we get 

1 C^^adapt 



n 



Op(^) + (Op(l) + op(l)Op(l)) Op(^) - A7r7w;i,,sgn((^i,,) 



1 



Op(l) 



nA 



7rj;t(;i,sSgn(0i,s^ 



/n \ ^/n 
Prom the assumption on the initial estimator, we have 



(26) 



nA 



nA nA 



n ^\<p\-l\ Op(l) 



oo 



as 



nA — )• oo. 



Therefore, the second term in the brackets of (|26p dominates the first and the probabiUty 
of the event 

adapt 



194 

sgn I - - 



n dcf) 



l,s 



-sgn((/>i,s) ^ 



tends to 1. But this contradicts the assumption that is a local minimizer (i.e., ^-^^^ 
Assertion 2. 

Write 6 = {Os,Os'')- From part 1), it follows that with probability tending to one ^5 is a 
root-n local minimizer of — n~^^adapt (^5,0). By using a Taylor expansion we find, 



0). 







1./ 



n 



'adapt 1 9g 



1 



'-~^'\do s + ~^"\do s 
n ' n 



(1) 



vri'u;i,5Sgn(0i,5) 

(1 - 7ri)'^u;2,5Sgn((^2,s) 



i^T^ E ''^ijl'^ijl - 7(1 - Ti-i)'^"^ E ^2 



(ij)e5 



(2j)G5 



Now term (1) is of order —Is{Go) + op(l) (law of large numbers); term (2) is of order 
op(l) (consistency); and term (3), with some abuse of notation an {\S\ + 3)-vector of 
{\S\ + 3) X (15"! + 3) matrices, is of order Op(l) (law of large numbers and regularity 
condition on the third derivatives). Therefore, we have 

V^-^'W s + (-^5(^0) + op{l)) V^{9s - do,s) - V^XOp{l) = 0, 
n 



or 



i-Isieo) + op(l)) V^iOs - 00,5) - V^AOp(l) 



n 



(27) 



Notice that —r= 



1 fi'i 



i^os -^{^1 ^si&o)) by the central limit theorem. Furthermore, \/nX 
0(1) as A = o{n~^^'^). Therefore, 



follows from Equation (j27p . 



□ 
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B Proofs for Section [5] 



B.l Proof of Lemma [T] 

Using a Taylor expansion, 



where 



sup max 
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< 



Hence 



mMx)) > U-M^)\\l^liJ2-d^^^C3U-Mx)\\l/6- 

Now, apply the auxiliary lemma below, with Kq = dK'^, = and 
C = d^l^C^/Q. 

Auxiliary Lemma. Let h : [—Kq.,Kq\ — )• [0, oo) have the following properties: 

(i) y e > d 3 Ue > Q such that infg<|2|<x„ h{z) > a^, 

(ii) 3 A > 0, C > 0, such that V \z\ < Kq, h{z) > b^z^ - C\zf . 
Then V \z\ < Kq, 

h{z) > z^lCl, 

where 



Cn = max 



1 M 

£0 "eo 



A2 
2C' 



Proof (Auxiliary Lemma) 

If eo > Kq, we have h{z) > A^z^/2 for ah \z\ < Kq. 

If eo < Kq and \z\ < Sq, we also have h{z) > (A^ - eoC)z'^ > A'^z'^/2. 

If eo ^ Kq and eo < \z\ < Kq, we have h{z) > = KqOs^/Kq > \z\'^asg/KQ 



B.2 Proof of Lemma [2] 

In order to prove Lemma O we first state and proof a suitable entropy bound: 
We introduce the norm 



IW'.OIIp. 



1 " 



i=l 
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For a collection H of functions on X x y, we let H^-^T-L, \\ ■ \\p„) be the entropy of % 
equipped with t he metric induced b y the norm || • \\p^ (for a definition of the entropy of a 
metric space see van de Geei ( 200d )). 



Define for e > 0, 



Entropy Lemma For a constant Cq depending on k and m, we have for all u > and 
M„ > 0, 



h{uA{L^- L^.)1{Gi < M„} : i9 e G(e; 



• ||P„ < Co 2- log 



u 



Proof (Entropy Lemma) We have 



|L^(x,y)-L.(x,2/)|2 < Gliy) 



< dGliy) 



r=l 
k 



1 2 



x\ + \\r] — fj\\i 



\T |2 , II ~||2 

) x\ + \\r] - r/||2 



r=l 



It follows that 



||(L^-L^)l{Gi<M„}|||,„<dAf2 



k ^ n 
r=l 1=1 



)^Xi|^ + \\r] - fjWl 



Let N(-,T, t) denote the covering number of a metric space (F, r) with metric (induced by 
the norm) r, and H{-,T,t) = log iV(-, F, r) be it s entropy (for a definition of the covering 
number of a metric space see Ivan de Geerl toodi )). If F is a ball with radius e in Euclidean 
space M^, one easily verifies that 

H{u,T,t) < iVlog ,y u>0. 

Thus H{u, [r] G : lh ~ ^n jh < e), || • lb) < mlog (^) , \/u > 0. Moreover, applying a 
bound as in Lemma 2.6.11 of van der Vaart and Wellner ( 19961 ) gives 



H 



(2u.{±( 



'0,rfxr : < ehl • \\Pn <^ + l log(l + M 



We can therefore conclude that 



H(^3VdMnU, |(L^ - L^,)l{Gi < Mn} : ^ G G(e)|, 



< I ^ + m + l\ Mog ( ^ ) + log(l + kp) 
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Let's now turn to the main proof of Lemma [2j 

In what follows, {q} are constants depending on Amax, k, m and K. The truncated version 
of the empirical process is 



1 



i=l 



= - E L^{xi,Yi)\{Gi{Yi) < M„} - E L^{xi,Y)\{Gi{Y) < M„} 



X = Xi 



Let e > be arbitrary. We invoke Lemma 3.2 in Ivan combined with a 

symmetrization lemma (e.g., a conditional version of Lemma 3.3 in Ivan 
We apply these lemmas to the class 



(L^ - L^JHGi < M„} : 7? G e(e) 
In the notation used in Lemma 3.2 of van de Geeil (|200d l. we take 



S = c^TeMn logniyiog(p V n)/n, and R = c^{e A 1)M„. This then gives 



X, sup >C6erM„logn 
v,?ee(e) 



log(p V n) 



n 



< C7 exp 



log^ n \og{p V n) (e^ V 1) 



Here, we use the bound (for < a < 1), 



'^/log(^)<iu<log3/2Ql 



We then invoke the peeling device: split the set into sets 

{t? E e : 2-(^^+i) < \\4> - Ml + h - ?7o||2 < 2-^'}, 

where j € Z, and 2^-'^^ > Aq- There are no more than cglogn indices j < with 
2~-5'+^ > Aq. Hence, we get 



sup 



with Px probability at least 



|i + Wv - 11*112) V Ao 



< 2cQTMn log n 



log{p V n) 



n 



1 — C7 [cg log n] exp 



log^ n log(p V n) 



00 

E C7 exp 



r222J log2nlog(p Vn) 



> 1 — C2 exp 



log^ n log(p V n) 



-10 



Finally, to remove the truncation, we use 

\{L4x,y)-L^,{x,y))l{Gi{y) > M„}| < dKGi{y)\{Gi{y) > M„}. 
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Hence 



0l|l + ||??-??1|2) VAo 



dK 



< ^^(Gi(y,)l{Gi(yi) >M4 + E Gi{Y)l{Gi{Y) > Mn} 

^ i=l 



X = Xi 



□ 



B.3 Proof of Theorem [3] 

On T 

£{i,\^l^o) + XUh<TXo 
By Lemma [U 

and £{ipo\ipo) = 0. 
Case 1 Suppose that 

Then we find 



4>o\\i + - mh) V Ao 



+ A||</.o||i+^(Vo|V'o) 



1 + 



I2 < Ao 



Case 2 Suppose that 
and that 
Then we get 



So then 

Case 3 Suppose that 
and that 



£{i^m < TXl + A||,^ - M\i + '^(V'olV'o) 

< (A + rAo)Ao. 

llfA - </'ol|i + - mh > ^0, 
TXoWfi - mh > (A + rAo)||<^s - {4>o)s\\i- 



£{i>\^lJo) + (A - TXo)Us4i < 2rAo||r/ - ^Ib 
<4T^Xlcl + \\f,-r,o\\l/{2cl) 
< iT^Xlcl + £iiP\^o)/2. 

£{i>\i^o) + 2(A - rAo)||<^5^||i < ST^Xlcl 



|i + II?? - %l|2 > Ao, 
TAoll?? - ??o||2 < (A + rAo)||<^s - iMsWi- 
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Then we have 

£{^\^o) + (A - rAo)||05-||i < 2(A + rAo)||<^5 - 
So then 

ll'^S'^lll < 6||05 - (0o)5||l- 
We can then apply the restricted eigenvalue condition to (/> — </>o. This gives 



e{il;\i;o) + {\-T\Q)UsA\i < 2(A + ^Ao)^/i||</)s - <Po||2. 

< 2{\ + T\Q)^/sK\\g - qqWq^ 

< 4{X + TXofcys + £{i^\tl;o)/2. 

So we arrive at 

£{iP\^o) + 2{X-TXo)Us4i < 8{X + TXofclKh. 



□ 



B.4 Proof of Lemma [3] 

Let Z he a standard normal random variable. Then by straightforward computations, for 
all M > 0, 

£:[|Z|1{|Z| > M}] < 2exp[-MV2], 

and 

£[\Z\WZ\ > M}] < (M + 2) exp[-MV2]. 



Thus, for n independent copies Zi, . . . , Zn of Z , and M = 2-v/logn, 

f('-±\zm\z.\>m}>'-^] 

< P f i V \Z,\l{\Zi\ > M} - £[\Z\l{\Z\ > A/}] > ) 

^ n£[\Z\H{\Z\ > M}] ^ 2 
~ 4(logn)^ ~ n 

The result follows from this, as 

Gi{Y) = e^\Y\ + K, 

and Y has a normal mixture distribution. □ 
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B.5 Proof of Theorem [5] 

On T, defined in ([18]) witli Aq = C4A/log'^ nlog(p V n)/n (04 as in Lemma[3l i.e., M„ 



A||</.o||i + ^(Vo|V'o). 



C4y^log(n) in (|20|) ). we have the basic inequahty 



^(V'|V'o) + A||(/<||i <rAo 



^olli + 11^7 - mh) V Ao 



Note that ||?? — ?7o||2 ^ 2i<r and £{'iPq\'iPq) = 0. Hence, for n sufficiently large, 

£{i^\iljo) + Xmi < rAo(||<^-</)o||i + 2i^) + A||0o||i+^(^o|V'o) 

< TAo(||<^||i + ll^olli + 2K) + XUoWi + ^(VolV'o), 

and therefore also 

^(V^lVo) + (A - rAo)||<^||i < rAo2K + (A + TAo)||-/'o||i +^(V'o|V'o). 



It holds that A > 2rAo (since A = C^/log nlog(pVn)/n for some C > sufficiently 



large), Aq = ©(y log'^ n log(p V n)/n) and A = 0( y log^ nlog(p V n)/n), and due to the 

assumption about \\4>o\\i we obtain on the set T that £{ip\ipo) — )• £{^o\ipo) = (n — )• 00). 
Finally, the set T has large probability, as shown by Lemma [2] and using Proposition [3] 
and Lemma El for FMR models. □ 



C Proofs for Sections [3] and [6] 
C.l Proof of Proposition [2] 

We restrict ourselves to a two class mixture with k = 2. Consider the function n(^) defined 



as 



tx(0=exp(^W(0) 



i=l 



{Y,-Xjp2?\\ /-A||/3l||l^ _ /-AII/32II1 
2 



2^1 )r''^\-^^)'''^\ a. 



We will show that u(^) is bounded from above for ^ = (/3i, /32, (T2, vri) G H = M x 

M?^o ^ [0' Then, clearly, — n~^^pen(0) is bounded from below for 6 = (^i, (1)2, Pi, P2, tti) G 
e = M2p X X (0,1). 

The critical point for unboundedness is if we choose for an arbitrary sample point i G 
{1, . . . ,n} a j3l such that 1^ — Xj (31 = and let cJi — )• 0. Without the penalty term 

II R* II 

exp(— A ) in (|28|) the function would tend to infinity as cJi — t- 0. But as 7^ for all 
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z G {1, . . . ,n}, PI cannot be zero, and therefore exp(— A^^^^) forces n(^) to tend to as 
fji ^ 0. 

Let us give a more formal proof for the boundedness of u(^). Choose a small < ei < 
minjl^'^} and £2 > 0. As Yi ^ 0, i = 1, ... ,n, there exists a small constant m > such 

i 

that 

< min{y,2} -ei< (Yi - Xj ^^f (29) 
holds for all i = 1, . . . , n as long as ||/3i ||i < m, and 

< min{y,2} _ < (y. _ xfh? (30) 

holds for alH = 1, . . . , n as long as ||/32||i < m. 
Furthermore, there exists a small constant 5 > such that 



— exp ( — (min{yj^} — ei)/2a'l) < £2 and — exp \— — ) < £2 



(31) 



hold for all < cri < 5, and 



— exp i — (min{yj^} — e\)l2a'^ < £2 and — exp [ — ^ ) < £2 (32) 
0"2 V j / cr2 V ^2 / 

hold for all < (T2 < 5. 

Define the set K = { (/3i , /32 , o"i , (T2 , vri ) £ H; 5 < ai, a2}. Now u(^) is trivially bounded on 
K. From the construction of K and Equations (j29l )-(f32l). we easily see that u(i^) is also 
bounded on K'^, and therefore bounded on S. □ 

C.2 Proof of Theorem [6] 

The density of the complete data is given by 

UY, Ai^) = n n i^Tt''"^'^'^^'''^'^^^^ 

i=l r=l ^ ^ 

whereas the density of the observed data is given by 

n k 



1=1 r=l ^ 



e = (pi,...,pfc,(^i,i,(/)i,2,...,(Afe,p,vr) e G, e = M^o X ^^'^ X n C 

with 

k-l k-1 

n = {vr = (vTi, . . . , 7rfe_i); VTr > for r = 1, . . . , A; — 1 and vr^ < 1}, vr^ = 1 — vr^ 

r=l r=l 
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Furthermore, the conditional density of the complete data given the observed data is given 
by k{Y, A\Y, 6) = fc{Y, A\e)/fohs{Y\e). Then, the penalized negative log-likelihood fulfills 
the equation 



-n 



-n 



'^\ogU,{Y\e) + xY,\ 



1 



(33) 



= Qpen{o\e') - H[e\e') 

where Q^eMO') = -n-^Ee^log /e(y, A|0)|y] + A^ti 

H{e\e') = -n-%/[iogfc(y,A|y,0)|y]. 



r=l 



II 



(compare Section WT) and 



By Jensen's inequality, we get the following important relationship 

H{e\9') > H{e'\e') v ^ g e, 



(34) 



see also Wu ( 19831 ) . Qpcn(^|^') and H{6\9') are continuous functions in 9 and 6' . If we think 
of them as functions of with fixed 9\ we write also Qpen,e'{9) and He' (9). Furthermore, 
Qpen,0'{9) is a convex function of 9 and strictly convex in each coordinate of 9. As a last 
pre paration, we g ive a definition of a stationary point for non-differentiable functions (see 
alsolTsengl fioolf)): 

Definition 1. Let u be a function defined on an open set U C ]Rfc(p+2)-i. A point xeU 
is called stationary if 

u'{x- d) = hm + - > Vd G M^(^+2)-\ 
a\0 a 



We are now ready to start with the proof which is inspired by iBertsekasI (| 19951 ) . We write 

9 = (6*1, ... ,6*1)) = (pi, . . . ,pfc,(/'i,i,(/'i,2, • • . ,</'fc,p,vr), 

where D = k + kp + 1 denotes the number of coordinates. Remark that the first D — \ 
coordinates are univariate, whereas 0£) = vr is a "block coordinate" of dimension k — 1. 

Proof Let 9"-" = 6l(™) be the sequence generated by the BCD-GEM algorithm. We need 
to prove that for a converging subsequence 9"^i 9 £ @, 9 is a stationary point of fpen(^)- 



Taking directional derivatives of Equation (|33p yields 

i^'p,^{9;d) = Q'^,,/9;d) - {V H-,{9),d). 

Note that V Hq{9) = as Hg{x) is minimized for x = 9 (Equation (I34p ). Therefore, it 
remains to show that Q'^^^ g{9; d) >0 for all directions d. Let 

Using the definition of the algorithm, we have 

<3pen,e™(^™) > Qpcn,e'"(^D > " " " > <5pcn,e™ (^^j^-i) > Qpcn,©™- (^""^^ ) • (35) 
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Additionally, from the properties of GEM (Equation (j33p and (j34p ). we have 

i^pen(^°) > l^pcnie') > • • • > i^pcn(e™) > ^^pen(0"^+' ) • (36) 

Equation (f36l) and the converging subsequence imply that the sequence {i'pen{d"^)',m = 
0, 1,2,.. .} converges to fpen(^)- Further, we have 



< Qpen,0'"(^'") — <5pon,6»" 



<o 

< l^pe„(^"^)-i^pen(e'"+'). (37) 
^ ' 

->-t'pon(6')-l'pon(6') = 

We conclude that the sequence {(5pen,S'"(^"^) — Qpcn,s™'(^'"^^)) ^ — 0; l; 2, • • •} converges 
to zero. 

We now show that {9^^^^ — 9^^} converges to zero (j — )• oo). Assume the contrary, in 
particular that {-z™-' — 9"^^} does not converge to 0. Let ^'"^^ = \\z'^^ — 9'"^^]. Without 
loss of generality (by restricting to a subsequence) , we may assume that there exists some 

7 > such that 7"^^ > 7 for all j. Let = ^^-tttj— ^. This differs from zero only 

along the first component. As s™^ belongs to a compact set (||s™^ || = 1) we may assume 
that s™^ converges to si. Let us fix some e S [0, 1]. Notice that < £7 < 7"^^ . Therefore, 
0"ij _j_ ^i^g^j ligg the segment joining 9'^^ and z"''^ , and belongs to because is 
convex. As Qpen,e"^i (') convex and minimizes this function over all values that 
differ from 9"^^ along the first coordinate, we obtain 

< Qpe„.e-.(^"^+e7^r) (38) 
From Equation (135]) and (138]) . we conclude 



lag ^ 



Using ([37]) and continuity of Qpen,xiy) in both arguments x and y, we conclude by taking 
the limit j — )• cxd: 

Qpen,e-(^ + m) = Qpcn,e-W VeG[0,l]. 



Since 7S1 7^ this contradicts the strict convexity of Qp^,^ ei^i^ ^2; • • • , ^d) as a function 
of the first block-coordinate. This contradiction establishes that converges to 9. 
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From the definition of the algorithm, we have 

By continuity and taking the limit j — )• oo, we obtain 



Repeatin g the argume nt we conclude that is a coordinate-wise minimum. Therefore, 
following |Tsen3 (|200lh . 9 is easily seen to be a stationary point of Qpg^ §{•), in particular 



^'pen e^^' ^) — ^ directions 



□ 
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